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Abstract. Let G = (G, +) be an additive group. The sumset theory of 
Pliinnecke and Ruzsa gives several relations between the size of sumsets A + 
B of finite sets A, B, and related objects such as iterated sumsets kA and 
difference sets A — B, while the inverse sumset theory of Freiman, Ruzsa, and 
others characterises those finite sets A for which A + yl is small. In this paper 
we establish analogous results in which the finite set A C G is replaced by 
a discrete random variable X taking values in G, and the cardinality \A\ is 
replaced by the Shannon entropy H(X). In particular, we classify the random 
variable X which have small doubling in the sense that H(Jfi +X2) = H(Jf) + 
0(1) when Xi , X2 are independent copies of X, by showing that they factorise 
as X = U + Z where U is uniformly distributed on a coset progression of 
bounded rank, and H(Z) = 0(1). 

When G is torsion-free, we also establish the sharp lower bound Ii(X+X) > 
H(X) + ^ log 2 — 0(1), where o(l) goes to zero as H(X) —> 00. 



1. Introduction 

The purpose of this paper is to estabhsh analogues of the Pliinnecke-Ruzsa- 
Freiman sumset and inverse sumset theory for finite subsets of discrete additive 
groups, in the setting of discrete random variables in such groups. 

1.1. Sumset and inverse sumset theory: a quick review. To motivate our 
results we begin by recalling some of the key results in sumset and inverse sumset 
theory. Let G = (G, +) be an additive group. For any finite non-empty sets A, B 
in G, we define the sumset 

A + B := {a + b : a<E A,b <E B) 

and difference set 

A- B -.^ {a-b : a<^ A,b <E B) 

and the iterated sumsets 2A = A + A, 3^4 = A + A + A, etc. We use \ A\ to denote 
the cardinality of a finite set A. 
We have the trivial bounds 

(1) \Al\B\<\A + B\<\A\\B\ 

and similarly for A — B. In particular, we see that the doubling constant 

is at least one. It is easy to see that this doubling constant is precisely one if and 
only if A is the translate of a finite subgroup of G. Intuitively, one thus expects that 
if the doubling constant of A is bounded, then A should in some sense behave like 
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a translate of a finite subgroup; this is one of the main objectives of the Pliinnecke- 
Ruzsa sumset theory. One is furthermore interested in classifying those sets A of 
small doubling constant; this is the main objectives of the Freiman-Ruzsa inverse 
sumset theory. 

We now give some sample results in this theory. One of the simplest is the Ruzsa 
triangle inequality 

(2) |^-c-|<^^i|^, 

valid for all non-empty finite subsets A,B,C of G (see e.g. [S], [HJ Lemma 2.6]). 
In a similar spirit, one has 

(see e.g. [5], [HI Corollary 2.12]). If a[A] < K, then one has the Pliinnecke- Ruzsa 
inequalities 

(4) \nA~mA\<K'^+"'\A\ 

for all 71, m > 1 (see e.g. [9], [121 Corollary 6.28]). We refer the reader to [9] or [12] 
for further details of these and related estimates. 

Another basic result is the Balog-Szemeredi-Gowers lemma [5], [5], which in- 
volves partial sumsets 

A + B := {a + b:{a,b) E} 
for any subset E oi A y. B: 

Lemma 1.2 (Balog-Szemeredi-Gowers lemma). Suppose that A^B are non-empty 
finite subsets of an additive group G, and let E C Ay.B be such that \E\ > \A\\B\/K 

and \A + B\ < K\A\^^'^\B\^/^ for some K > 1. Then there exists subsets A' C 
A,B' CB with \A'\ > \A\/K, \B'\ > \B\/K such that \A' + B'\ < K'^\A'\^/^\B\^/'^ . 

Here and in the sequel, we use X <^ Y or X ~ 0{Y) to denote the estimate 
\X\ < CY for some absolute constant Y, and X ^Y a,s shorthand for X <C I' <C X. 
If we need the implied constant C to depend in a parameter, we will indicate this 
by subscripts, thus for instance 0/^(1) denotes a quantity bounded in magnitude 
by Ck for some Ck depending only on K . 

Proof See (El Theorem 2.29]. □ 

Now we turn to inverse theorems. A basic concept here is that of a coset pro- 
gression, which unifies the concept of a coset and of an arithmetic progression. 

Definition 1.3 (Coset progression). [4] A coset progression in an additive group is 
any set of the form H-\-P, where H is a finite subgroup of G, and P is a generalised 
arithmetic progression, i.e. a set of the form 

P:={x + niri + ...+ndrd:nie [0, Ni), . . . ,nd £ [Q,Nd)} 

where d > is an integer, x, ri, . . . , lie in G, A^i, . . . , Nd > 1 are integers, and 
[0, A'') := {0, . . . , A^ — 1}. We call d the rank of the progression. We say that the 
coset progression is t-proper for some i > if the sums h -\- x -\- niri -f . . . -|- Udrd 
for h G H and Ui G [0,tNi) are distinct, and proper if it is 1-proper. 
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It is easy to see that a coset progression of rank d has doubhng constant at 
most 2'^. More generally, if A is a subset of a coset progression H + P with \A\ > 
\H+P\/K, then A has doubling constant at most The following Freiman-type 

theorem, first proven in ^ , establishes a partial converse to this claim: 

Theorem 1.4 (Green- Ruzsa Freiman theorem). Let G he an additive group, and 
let A d G be a finite non-empty set with a[A] < K for some K > I. Then there 
exists a coset progression H + P of rank 0{K) and size \H + P\ < cxp{0{K^^^'>))\A\ 
such that Ac H + P. 

Proof See HI Theorem 5.44]. □ 

1.5. Shannon entropy. We now turn to the concept of Shannon entropy. 

Definition 1.6 (Shannon entropy). Let A be a (discrete) set. Let Pic{A) denote 
the set of all probability measures on A with compact (i.e. finite) support, or 
equivalently a function p : A — > [0,1] which is non-zero for only finitely many 
values, and adds up to one. Define an A-random variable to be a random variable 
X taking values in a finite subset range(X) :— {x £ A : P{x £ X) ^ 0}, thus the 
distribution function px{x) := P(a; G X) of X hes in Prc(A). We write X = Y 
if px = Py, i-e. if X,Y have the same distribution. We refer to random variables 
taking values in a finite set as discrete random variables. 

The Shannon entropy H(p) of a probability distribution p G Prc(^) is given by 
the formula 

(5) H(p):=^F(p(x)) 

xeA 

where F : R+ R+ is the function 

(6) F(a;):=xlog- 

X 

with the convention that F{0) = 0. Given an A-random variable X, we then define 
H(X) ■.= Il{px). 

The basic theory of Shannon entropy is reviewed in Appendix |^ For now, we 
just remark that 

(7) < H(X) < log|range(X)| 

for any discrete random variable X, with equality in the former inequality if and 
only if X is deterministic (i.e. it only takes on one value), and equality in the latter 
inequality if and only if it is uniformly distributed in range(X); see Lemma lA. II for 
a more precise statement. In particular, a boolean random variable (i.e. one which 
takes values in {0, 1}) has entropy at most log 2 with our choice of normalisation of 
entropy. One can view G-random variables as a generalisation of the concept of a 
finite non-empty subset of G, in which the weight (or probability) assigned to each 
element in the range is not necessarily uniform. 

Given two G-random variables X,Y (not necessarily independent), their sum 
X + Y and X — Y are also G-random variables, and range(X ztY) C range(Ar) ± 
range(y). From standard entropy inequalities one has the trivial upper bound 

(8) h(a: ± y) < H(x) + H(r) 

while if X and Y are independent, one also has the trivial lower bound 

(9) H{X),U{Y)<U{X±Y); 



4 



TERENCE TAO 



see Lemma [2m The lower bound ([9]) can of course fail if the independence hypoth- 
esis is dropped; for instance one clearly has H(X — X) = 0. 

We can define the doubling constant a[X] of a G-random variable by the formula 

a[X] exp(H(Xi + X2) - H(X)) 

where Xi, X2 are independent copies of X, thus (t[X] > 1 by ©• This quantity is 
related, but not identical, to the doubling constant a[A] of a set; indeed, from ([7]) 
we see that 

(10) <t[X] < cr[A] 

whenever X is uniformly distributed on a finite non-empty subset A of G. However, 
the doubling constant of a random variable can be significantly smaller than that 
of its range. For instance, let A be an interval [0, N) together with ^/N (say) other 
integers in general position, where N is large. Then the doubling constant of A is 
about v^, but the uniform distribution on A has doubling constant 0(1). Thus 
we see that a small amount of "noise" (such as the ^/N integers in general position) 
can significantly increase the doubling constant of a set, but have only a negligible 
impact on the doubling constant of a random variable. Heuristically, one can thus 
think of entropy sumset theory as a "noise-tolerant" analogue of combinatorial 
sumset theory. 

E 

The analogue of a partial sumset A + B here is the concept of a sum X + Y of 
non-independent random variables X, Y. For instance, ii E d Ax B is a non-empty 
set, and (X, Y) is the random variable chosen uniformly at random from E, then 

E 

X + Y is a. random variable ranging in A + B. 

There are several ways to define the distance between two G-random variables 
X,Y (or their associated distributions px,Py)- For instance, we can define their 
total variation distance 

(11) distTv{X,Y) = distrviPx, Py) ■= ^ \px{x) -py{x)\; 

xeG 

this is clearly a metric on Prc(G). Another useful distance is the Rusza distance 

(12) distH(X, Y) = distR{px,PY) ■■= H(X' - Y') - ^H(X') - ^H(r') 

where X', Y' are independent copies of X, Y respectively; this is not quite a metric 
(in particular, distij(X, X) > in general), but does obey the triangle inequality 
as we will see in Theorem 11.101 below . A third distance of importance to us is the 
following transport distance: 

Definition 1.7 (Transport metric). Let G be an additive group, and let X,Y be 

G-random variables. We define the entropy transport distance disttr(X, y) from X 
to Y to be the infimum of H{Z), where Z ranges over all G-random variables (not 
necessarily independent of X) such that X + Z = Y . 

Observe that disttr(X, F) = if and only if Y has the distribution of a translate 
X + c of X. Up to this equivalence, it is easy to see that the transport distance is 
indeed a metric. The notion of transport metric depends only on the distribution, 
so by abuse of notation we may define disttr {px ,Py) ■= disttr(-'^, 5^)- The notion 
of two random variables being close in transport metric is roughly analogous to the 
notion of (mutual) K -control of one set by another, introduced in [TT] . 
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Example 1.8. Let A'^ be a large even integer, let X be the uniform distribution on 
[0, N), and let Y be the uniform distribution on the even numbers in [0, N). Then 
the total variation distance distTv{X, Y) is quite large (comparable to its maximal 
value of 1). On the other hand, the Ruzsa distance is quite small (of size 0(1)). 
The transport distance is also of size 0(1); indeed, one can transport F to X by 
adding a uniform boolean variable Z e {0, 1} which is independent of Y; conversely, 
one can transport X to 1" by subtracting off the parity bit Z of X (which is not 
independent of X). In fact, the uniform distribution on any dense subset of [0, N) 
lies within 0(1) of X in the transport distance, although this is not as obvious to 
see; see Corollary 14.61 below. 

The Ruzsa distance, doubling constant, and transport distance interact well with 
each other. For instance, we have the identity 

(13) a[X] = exp(disti?(X, -X)) 
and the Lipschitz type properties 

(14) I distfl(X', Y') - disti^(X, Y)\ < ^(distt,(X, X') + disttr(y, Y')) 

for any G-random variables X,Y, X' ,Y' , as can be seen by several applications of 
([5]). In particular 

(15) |loga[X]-loga[X']| <3disttr(X,^')■ 
Thus we see that random variables which are close in transport distance are essen- 
tially equivalent from the perspective of their sumset theory. 

1.9. Main results. We can now state our main results. We begin with some 
sumset estimates, analogous to ([2]), (O, ©: 

Theorem 1.10 (Entropy sumset estimates). Let G he an additive group, and let 
X, Y, Z be G-random variables. 

• (Ruzsa triangle inequality) We have 

(16) distij(X, Z) < distfl(X, Y) + distR{Y, Z). 

• (Sum-difference inequality) One has 

(17) distflXX, -Y) < 3 disti?(X, Y). 

• (Weak Pliinnecke- Ruzsa inequality) If Xi, . . . , X„, . . . , X[^ are indepen- 
dent copies of X for some integers n,m>0, then 

(18) H(Xi + ...+X„^X[-...^X:^J< H(X) + 0{{n + m) loga[X]). 

We prove these estimates in Section [51 The estimate loses an absolute 
constant in comparision to (the logarithm of) (|3|). This is because we do not know 
how to adapt the graph-theoretic Pliinnecke inequality [8] to the entropy setting, 
and so must rely instead on some weaker but less deep arguments in [T^] to establish 
results analogous to ^ instead. 

The analogue of the Balog-Szemeredi theorem is a little more technical to state, 
requiring the concept of conditional entropy and conditionally independent trials, 
and will be deferred to Section [3l 

We turn now to an inverse theorem for entropy in the spirit of Theorem 11.41 

Theorem 1.11 (Inverse sumset theorem). Let G be an additive group, and let X 
be a G-random variable. 
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(i) (y[X] = 1 if and only if X is the uniform distribution on a coset of a finite 
subgroup of G. 

ill) If ij[X] < K, then there exists a coset progression H + P of rank Ox(l) 
such that disttr(^, U) <^k 1? where U is the uniform distribution on H + P. 
(iii) IfdistR{X,Y) < K, then disttr(X,y) 1 and a[X] < . 

Note that the uniform distribution on a coset progression H + P oi rank d has 
doubhng constant at most 2^^, by (fTO|) . so from (fTS]) we obtain a partial converse to 
(ii): if disttr(^, U) < K where U is the uniform distribution on a coset progression 
of rank at most then a[X] 1- Similarly, (I14|) gives a partial converse to (iii). 
Thus, up to constants. Theorem 11.111 gives a satisfactory description of random 
variables with small doubling constant or Ruzsa distance. 

The implied constants in Theorem 11.111 can be explicitly computed from the 
proof, but are rather poor (being triple exponential in K). We will not attempt to 
optimise these constants here. 

We prove Theorem 11.111 in Section [5l 

1.12. The torsion-free case. When G is a torsion-free group (thus nx ^ for all 
X ^ m G and all integers n > 0), then the trivial doubling estimate a[A] > 1 can 
be improved. Indeed, one has 

\A + A\> 2\A\ - 1 

for any finite non-empty subset A of a torsion-free group G, since A can be mapped 
onto the integers by a Freiman isomorphism (see |12[ Lemma 5.25]). In other words, 
one has 

(19) a[A]>2-^. 

The example of an arithmetic progression (e.g A = [0,7i)) shows that this estimate 
is sharp. 

One can ask whether the same statement holds for entropy. The following ex- 
ample shows that this is not quite the case. Let n be a large integer, and let X„ be 
the sum of n independent Bernoulli variables ei, . . . , e„ € +1} with an equal 
probability of each. From the central limit theorem (or Stirling's formula), wc know 
that Xn is approximately distributed like a gaussian of mean zero and variance n, 
thus 

v27rn 

Approximating the Ricmann sum by an integral, we then expect 
H(X„) w / F(^^e-=^'/2«) ^ log\/2^+ i. 



It is not hard to make this heuristic precise, and obtain the asymptotic 

H(X„) =logV2^-f i+o(l). 

In particular, since X„ + X'^ = X^n if X'^^ is an independent copy of X„, we see 
that 



a[X„] = V2-o(l), 
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which is less than what one might have predicted from (|T9|) . (|TOl) . The point is 
that in the entropy setting one can construct "approximate gaussian" counterex- 
amples whose closest analogue in the combinatorial setting, namely the arithmetic 
progressions, are less efficient by a constant factor. 

It should not be surprising to experts in information theory that this gaussian- 
type bound is best possible: 

Theorem 1.13. If e > 0, G is torsion-free, and X is a G-random variable, then 

a[X) > %/2-e, 
provided H(X) is sufficiently large depending on e. 
In asymptotic notation. Theorem II . 131 asserts that 

{X) > V2 

— Oh(V)— oo(l)- 

We prove Theorem 11.131 in Section [S] below. This result combines the inverse 
theorem in Theorem 11.111 with an analogous inequality concerning the Shannon 
entropy 

Hr(X) := / F{px{x)) dx 

of a continuous random variable X taking values of R, namely 

(20) Hr(5 + T) > i(HR(5) + HR(r)) + i log 2 

for independent continuous random variables S, T (see [TJ Theorem 2]). The inverse 
sumset theory is necessary in order to approximate the discrete random variable by 
a continuous one in a certain sense. 

In [1], the continuous entropy inequality 



Hr(Xi + . . . + X„+i) > Hr(Xi + . . . + X„) + log ^^^^ 

was established, where Xi, . . . , Xn+i were independent copies of the same contin- 
uous random variable. In view of Theorem 11.131 it is thus natural to conjecture 
that 



(21) H(Xi + . . . + X„+i) > H(Xi + . . . + X„) + log - e 

for any e > and any G-random variable X, if G is torsion-free and H{X) is 
sufficiently large depending on n, e. Unfortunately we were not able to establish 
this because the inverse theorem is not applicable in this setting, nevertheless we 
believe (|2T|) to be true. 

Finally, we remark that a number of additional entropy sumset inequalities were 
recently established in For instance, it was shown that 

H(X + Y + Z)< ^(H(X + ¥)+ U{Y + Z) + H(Z + X)) 

for independent G-random variables X, Y, Z, which is an entropy analogue of the 
inequality 

\A + B + C\ < \A + B\^^'^\B + C\^/^\C + A\^^^ 

(see e.g. [6] or [3] for a proof). However, these bounds are primarily of interest in 
the regime where the doubling constants of the sets involved are large, and so are 
not directly related to the ones presented here. 
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2. SUMSET ESTIMATES 

In this section we establish the various sumset estimates claimed in the intro- 
duction, and in particular establish Theorem II. 101 The main tools will be entropy 
inequalities (in particular the submodularity inequality. Lemma IA.2p . elementary 
arithmetic identities, and independent and conditionally independent trials. 

Readers who are familiar with the combinatorial analogues of these inequalities 
are invited to "pretend" that all of the random variables below are uniformly dis- 
tributed on various finite sets, and in particular on finite groups, in order to see the 
analogy between both the statements and the proofs of the combinatorial and the 
entropy estimates. Indeed, the arguments here were discovered by the reverse of this 
procedure, in which the author searched for the nearest entropy-theoretic analogue 
to each step in the combinatorial arguments. For instance, if the combinatorial 
argument required one to pick an object a from a finite set A, the entropy-based 
argument would instead consider an analogous random variable that was naturally 
associated to A; if the combinatorial argument required two objects to be related 
in some way, this usually manifested itself as a coupling of random variables (e.g. 
by the use of conditionally independent trials); and so forth. 

We begin with the trivial sum set estimates. 

Lemma 2.1 (Trivial sumset estimate). If X,Y are two G-random variables, and 
Z is a discrete random variable, then 

H(A + Y\Z) < H(A|Z) + H(y|Z). 

If furthermore X,Y are conditionally independent relative to Z, then 

max(H(A|Z), H(y|Z)) < H(A + Y\Z). 

In particular we have the inequalities ([8]), ([9]), and distfl(A, y) > for all G- 
random variables X, Y . 

Proof. By conditioning on Z we may assume that Z is deterministic, thus the task 
reduces to showing ([5]) and Q. The former inequality follows from and (|551) 
since {X,Y) determines X + Y . To prove the latter inequality, observe from (|5T|) . 
(|87)) . and the independence of X, Y that 

H(A + Y)> H(A + Y\Y) ^ H(A|r) = H(A) 

and similarly H(A + Y) > H(y), and the claim follows. □ 

Now we establish the Ruzsa triangle inequality p6|) , which we rewrite as 

H(A - Z)< H(A -Y) + H(y -Z)- H(r) 

where X, Y, Z are independent G-random variables. Observe that {X — Y,Y — Z) 
and {X, Z) both determine X — Z, while (A — Y, Y—Z) and (A, Z) jointly determine 
{X,Y,Z). By the submodularity inequality (Lemma IA.2P we conclude that 

H(A, Y, Z) + H(A - Z)< H(A -Y,Y - Z) + H(A, Z). 
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Applying ([82]) and the independence hypotheses we obtain the claim. 

To prove ()17p . we introduce the idea of conditionally independent trials. Given 
two random variables X^Y (not necessarily independent), we can produce two 
conditionally independent trials Xi,X2 of X relative to Y, defined by declar- 
ing ~ y) and (^2!^ — y) to be independent trials of {X\Y — y) for all 
y £ range(K), thus in particular Xi = X2 = X, and Xi,X2 are conditionally 
independent relative to Y . Observe from conditional independence that 

H(Xi,X2|r) = H(Xi|y) + H(X2|r) = 2H(x|r) 

and thus 

(22) H(Xi,X2,r) = 2H(x,r)-H(y). 

Let X,Y be independent G-random variables. Let (Xi, Yi), (X2, 12) be condi- 
tionally independent trials of {X,Y) relative to X — Y\ since {X,Y) determines 
X — Y, we conclude that Xi — Yi = X2 — l2- Let (^3,13) be another trial of 
{X, Y), independent of Xi. X2, Yi,Y2, then we have the identity 

X3 + Y3 = (X3 - Y2) - {Xi - Y3) +X2 + Y1. 

Thus {X3 -Y2,Xi-Y3,X2,Yi) and (X3, ^3) each determine ^3 + ^3, while {X3 - 
Y2,Xi ~Y3,X2,Yi) and (^3,13) together determine (Xi, X2, X3, Yi, ^2, i^a); ap- 
plying the submodularity inequality fLemma lA.2p we conclude 

H(Xi, X2, X3, Yi,Y2, Y3) + H(X3 + ^3) < H(X3 -Y2,Xi^ Y3, X2,Yi) + HiX3, Y3). 

But from ((22l) . ((82|) . and the independence hypotheses we have 

H(Xi, X2, X3,Yi,Y2,Y3) = 2H(X, Y) - H(X -Y) + H(X) + H(r) 

U{X3 + Y3) = U{X + Y) 

H(X3 -Y2,Xi-Y3,X2,Yi) < 2H(X -Y) + H(X) + U(Y) 

H(X3,i:3) -H(X) + H(r) 

and thus 

(23) h(a: + r) < 3H(a: - y) - h(a:) - H(y) 

which rearranges to form (jl7p . 

Finally, we establish (fT5)) . Let X,Y be independent G-randoni variables, and 
let {Xq, Yo), (X„, y„) be independent trials of {X, Y). Set Si ^ Xi + Yi for 
< i < n. We observe the identity 

So + ... + Sn^iYo + Xi) + (yi + X2) + . . . + (y„-i + Xn) + (y„ + Xo). 

In particular, we see that {Xq, Yq, Si, ... , S'„) and (Yq + Xi, . . . , y„_i+Ar„, y„ + Xo) 
both determine Sq + . . . + Sn, while {Xq, Yq, Si, . . . , Sn) and {Yq + Xi, . . . , Y^-i + 
Xn, Yn + Xq) jointly determine {Xq, . . . , Xn, Yq, . . . , y„). Applying the submodu- 
larity inequality fLemma lA.2p we conclude 

B.{Xo, ...,Xn,Yo,..., y„)+H(S'o+. . .+Sn) < n{Xo, Yo, Si,..., Sn)+ll{YQ+Xi, y„_i-t-X„, Y^+Xq). 
But from (|5^ and the independence hypotheses we have 

H(Xo,...,X„,yo,...,y„) - (n+l)(H(X) + H(y)) 

U{Xo, Yo,Si,..., Sn) = H(X) + H(y) + nH{X + Y) 
U{Yq + Xi, . . . , y„_i + X„, y„ + Xq) <{n+ 1)U{X + Y); 
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Putting all this together we obtain the inequality 

H(5o + ... + Sn)< {2n + 1)H(X + Y)- nU(X) - nH.{Y). 

In particular, if Xi, . . . , X2n+2 are independent copies of X then the above inequal- 
ity (setting Y to be another independent copy of X) gives 

H(Xi + . . . + X2n+2) < H(X) + (2n + 1) loga[X]; 

applying ([9]) one concludes that 

H(Xi + . . . + Xn) = H(X) + 0{n\oga[X]) 

for any n> 1. Applying ([23|) one then concludes that 

H(Xi + ... + Xn-X[- = H(X) + 0{{n + m) loga[X]) 

for any n, m > 1, and the claim follows. The proof of Theorem 11.101 is now 
complete. 

3. An entropy version of the Balog-Szemeredi-Gowers lemma 

We now state an entropy analogue of the Balog-Szemeredi-Gowers lemma. In 
the combinatorial setting, one had the notion of a refinement A' of a set A, which 
was a subset A' of A which still had size comparable to A. In the entropy setting, 
the corresponding notion is that of a conditioning of a random variable X relative 
to some other related random variable Y, such that H(A"|F) was still close to 
H{X). The entropy Balog-Szemeredi-Gowers lemma then asserts that if two weakly 
dependent random variables X, Y have a sum of small entropy, then there exist 
conditionings of X, Y (which capture most of the entropy) whose independent sum 
still has small entropy. 

In fact, the conditioning can be given explicitly: 

Theorem 3.1 (Entropy Balog-Szemeredi-Gowers lemma). Let G be an additive 
group, and let X, Y be G-random variables which are weakly dependent in the sense 
that 

(24) H(X,y) >H(X) + H(r) -logi^ 
for some K >\. Suppose also that 

(25) H(X + y)<iH(X) + iH(r)+logi^. 

Then if we let {Xi , Y) , {X2 , Y) be conditionally independent trials of {X, Y) condi- 
tioning on Y, and then let {Xi, X2,Y) and (Xi,F') be conditionally independent 
trials of (Xi, X2,Y) and (Xi,Y) conditioning on Xi, then X2 and Y' are condi- 
tionally independent relative to Xi , Y, with 

(26) H(X2|Xi,y) >H(X)-logif 

(27) U{Y'\XuY)>U{Y)~\ogK 

(28) U{X2 + Y'\X,,Y) < ^U{X) + ^n{Y)-^7logK. 

This should be compared with Lemma 11.21 The appearance of the exponent 7 
in both statements is not coincidental, as the proofs are fundamentally the same. 
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Remark 3.2. Let E <Z A x B he a regular bipartite graph between two finite non- 
empty sets A,B in G, thus the A-degree \{b E B : {a,b) E E}\ is independent of 
a G A, and similarly the _B-degree \{a E A : {a,b) £ E}\ is independent oib E B. 
Let (X, Y) be an element of E chosen uniformly at random. Then the random 
variables (Y' , Xi,Y, X2) defined above are drawn uniformly from the space of all 
paths {b,a,b',a') of length three in E, thus (a, 6), (a, 6'), (a', 6') e E. It may be 
helpful to keep this example in mind when going through the proof of Theorem 
13.11 Not surprisingly, paths of length three also play a major role in the proof of 
Theorem O 

We now establish the theorem. By construction, Y' and X2, Y are conditionally 
independent relative to Xi, and thus X2 and Y' are conditionally independent 
relative to Xi,Y as claimed. Also, since Xi is conditionally independent of X2 
relative to Y, one has 

H(X2|Xi,y) = H(X2|r) = H(x|r) 

and ((26|) follows from ([24| . Similarly, since Y, Y' are conditionally independent 
relative to Xi, one has 

H(y'|Xi, y) = H(y'|Xi) = H(y|x) 

and ^ follows from ((24)) . 

The only remaining claim to establish is (|28|) . We need a preliminary lemma: 

Lemma 3.3 (Weak Balog-Szemeredi-Gowers lemma). We have 
H(Xi - X2 |y) < H(X) + 4 log K. 

Proof. Let (Xi, X2, y), (Xi, X2, y') be two conditionally independent copies of 
{Xi,X2, Y) relative to (Xi, X2). Observe that (Xi, X2, Y), (Xi + Y' , X2 + Y' , Y) 
both determine {Xi-X2,Y), and that {Xi,X2,Y) and (Xi + y', Xa + y', y) jointly 
determine {Xi, X2, Y, Y'). Applying the submodularity inequality fLemma lA.2p we 
conclude that 

H(Xi, X2, y, y') + H(Xi - X2,Y) < H(Xi, X2, y) + h(Xi + y, X2 + y', y). 

But from ([5]), (HH), ([711), one has 

H(Xi, X2, y, y') - 2H(Xi, X2, y) - h(Xi, Xa) 
> 4H(x, y) - 2H(y) - 2H(a:) 
H(Xi - X2, y) = H(Xi - X2|y) + H(y) 
H(Xi, X2, y) - 2H(x, y) - H(y) 
H(Xi + y, X2 + y, y) < 2H(x + y) + H(y) 

and thus 

H(Xi - X2\Y) < 2H{X + Y)+ H(y) + 2H{X) - 2H(X, Y), 
and the claim then follows from (l24|), ([25]). □ 

Now observe that {X2, Y' , Y) and (Xi - X2,Xi+Y',Y) both determine {X2 + 
y, y), and that {X2, Y' , Y) and (X1-X2, Xi+Y' , Y) jointly determine (Xi, X2, y, Y'). 
Applying the submodularity inequality (Lemma IA.2[) we conclude that 

H(Xi, X2, y, y') + H(X2 + y , y) < h(X2, y , y) + h(Xi - X2, + y , y). 
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But from ([8]), ([79|) . and (a generalisation of) (f22|) . one has 

H(Xi, X2, r, Y') - H(Xi, X2, r) + H(Xi, r') - h(Xi) 
= 2H(x, r) - H(r) + H(x, r) - n{x) 
H(X2 + r', Y) = H(X2 + + H(r) 
H(X2,y',y) <H(X2,y) + H(r') 
= H(x, r) + H(y) 
H(Xi - X2, Xi + y, y) < H(Xi - X2 |r) + h(f) + h(Xi + y') 

< H(Xi - X2|r) + H(y) + H(x + y) 

Substituting these bounds, we obtain 

H(X2 + y'|y) < H(Xi - X2|y) + 2H(y) + h(x) + h(x + y) - 2H(x, y). 

Applying Lemma [3T3l (f24|) . (f25|) we conclude that 

H(X2 + y'|y) < iH(X) + iH(y) + 71ogif 
and the claim follows from ([55]) . 

4. Uniformisation 

The main purpose of this section is to establish the following uniformisation 
bound on groups, as well as an analogous result for coset progressions (see Corollary 
SH). 

Theorem 4.1 (Uniformisation on a group). Let G be a finite group, let pu :— 
he the uniform distribution on G, and let p G Prc(G) be another distribution, such 
that 

H(p) >log|G|-logi^ 

for some K > 10. Then 

disttr {p,pu) < logi^. 

Since H(p[/) — log|G|, we see that this is sharp up to constants. One can 
view this theorem as a special case of Theorem 11.111 but with significantly better 
dependence on the constants. 

We establish this theorem by a sequence of partial results. We first record a 
simple lemma that allows us to "divide and conquer" the problem of estimating the 
transport distance between two random variables. 

Lemma 4.2 (Transport splitting lemma). Let G be a group, let X,Y be G-random 
variables, and let S be another discrete random variable; we do not assume X, Y, S 
to be independent. Then 

distt,(X,y)<H(5)+ Ps{s)dist,,{{X\S^s),{Y\S = s)). 

s Grange (5) 

Proof. Let e > 0. For each s G range(S'), we can use Definition 1 1.71 to select a ran- 
dom variable Zg conditioned to the event S" = s of entropy H(Zs) < disttr ((-'i^|5' = 
s), {Y\S = s)) + e such that {X + Z^S = s) = {Y\S = s). If we then let Z be the 
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random variable whose conditioning to S = s equals Zg, then X + Z = Y , and (by 

m) 

H(Z)<H(5) + H(Z|5) = H(5)+ J2 Psis)li{Z\S = s). 

serangc(5) 

Putting all this together, we conclude that 

disttr(X,y) <H(5)+ Ps(s)disttr((X|5 = s),(r|5 = s)) + |range(5)|e. 

sGrangc(5) 

Since e was arbitrary, the claim follows. □ 

Next, we show that one can converge exponentially fast to the uniform distribu- 
tion in the sense. 

Lemma 4.3 {L^ flattening lemma). Let G be a finite group, let pjj := be the 

uniform distribution on G, and let p e Ptc{G) be another distribution. Then for 
any integer k > 1, one can find a distribution pk G Prc(G') such that 

disttr(p,Pfc) < fclog2 

and 

\\Pk -puWe^G) < 2~''/'^\\p-pu\\e^G)- 

Proof. By induction it suffices to verify the case fc = 1. We use the first moment 
method. Let h be chosen uniformly at random from G, and let 

Pi{x) ■= ^ip{x) +p{x- h)). 

Clearly pi is the convolution of p with a Bernoulli variable of entropy log 2, and 
so disttr(p,Pi) < log2. On the other hand, a straightforward calculation using 
J^xecPi^) ~ 1 reveals the identity 

E/i||pi ~Pu\\%(G) = ^\\P-PU\\%(G) 

and the claim follows. □ 
We now combine these lemmas to pass to an ^^-bounded random variable. 

Lemma 4.4 (Entropy-uniform to ^^-bounded). Let G be a finite group, and let 
p G Prc(G) be such that 

H(p) >log|G|-logi^ 
for some K > 10. Then there exists q € Prc(G) with 

disttr(p,g) < (logisT) 

and 

« l/|G|i/^ 

Proof. The basic idea here is to flatten all the regions of G in which X has an 
abnormally high probability density. 
By Lemma I A. 1[ we have 

oc 

(29) ^2^P(Xe Afc)«logi^ 

fc=i 
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where Ak are the sets 

for fc > 1, and then set Aq := G\ ljfe°=i ^fc? thus the Ai, . . . partition G (and 
thus only finitely many are non-empty). 

Let X be a random variable with distribution p. For each fc > 0, let be the 
event that X E Ak, thus the Ek partition probability space. From we have 

oo 

(30) ^2'=P(Sfe) «logX. 

k=l 

Suppose that > 1 is such that Ek has positive probability. Then we can define 
Xk to be the random variable Xk := {X\Ek)- Observe that px^ is bounded above 

by p(^Ek)\G\ ' have the crude bound 



\\pxJ\e^G) 



< 



P{Ek)\G\y^- 

Applying Lemma [4.31 (with k replaced by a large multiple of 2''' + log p^^), one 
can thus find qk E Prc(G') such that 

(31) clisttr(px, , ft) « 2^ + log 
and 

(32) \\qk-Pu\\PiG)<'^/\G\'^'- 

(Indeed, one could even gain a factor of 2^^^ on the right-hand side of though 
this turns out to be unnecessary for the current argument.) 
Now set q £ Prc(G) to be the probability distribution 

oo 

(33) 9 = l^^„px + ^P(Sfc)%. 

fc=l 

Observe that px is bounded by 2/|G| on Eo. From ([55)1 . ([5^ and the triangle 
inequality we conclude that 

From Lemma 14.21 (setting S to be the random variable induced by the partition 

Ek), we see that 

oo ^ oo 

(34) disttr(p, q)<Y. P(^fe) log pTFT + E disttr(px,, ft). 

fe=0 ^ '^^ k=l 

From dSll), (ini), (1301) we conclude that 

OO ^ 

disttr(p, q) « log A' + P(^fc) log p7^- 

fc=0 ^ 

But from jSO]), P{Ek) < 2-*^ log AT, and so 

P(i?fc)logp^ « (1 + fc)2-nogif, 
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and thus 

oo ^ 

and the claim follows. □ 

From the triangle inequality, it is now clear that Theorem l4.1l follows from Lemma 
14.41 Lemma 1331 and 

Lemma 4.5 (^^-boundcd to uniform). Let G be a finite group, and let p G Prc(G') 
be such that \\p — pu\\e^{G) 5; 1/|G'|"'^^^. Then disttr(p,P;7) ^ 1- 

Proof. The idea is to manually transport away the most severe irregularities in the 
distribution of p to obtain a new distribution that is significantly closer to uniform 
in the i'^ norm, and then iterate. 

Let Cg be the supremum of disttr(p,p;7) for allp £ Pic{G) with \\p~'Pu\\e'^{G) ^ 
1/\G\^^^. It is easy to see that Cg is finite for any fixed finite G; our task is to 
show that Gg ^ 1 (uniformly in G). 

Let fc > 1 be a large integer to be chosen later. Let p e Prc(G) be such that \\p — 
Pu\\i^{G) ^ l/IG*!^/^, then by Lemma l473l one can find q G Prc(G) with distti (p, g) < 
fclog2 and 

(35) ||g-pa||.^(G) <2-'=/V|Gr/^ 

If q = pu, then we have disttr(p,p;7) < k\og2, so suppose instead that q is not 
identically equal to pu . Then the quantity 

(T ^ q{x)~pu{x)^ ^ pu{x)-q{x) 

x£G:q{x)>pu{x) x£G:q(x)<pu(x) 

is non-zero; from (j35p we also have 

(36) a < 2-^1"^. 

Let q+,q- G Prc(G) be the probability distributions defined by 

Q+ix) ■■= ^lq(x)>pu{x){q{x) -PU{X)) 

and ^ 

q_(x) -lq(x)<puix){Pu{x) - q{x)), 

thus 

q=PU + crq+ - crq- ■ 
We can thus build a random variable X with distribution q by creating a boolean 
random variable S G {0,1} with ^5(1) = cr, then setting {X\S = 1) to have 
distribution (7+ and {X\S = 0) to have distribution j^ipu ~ fq-). If we let Y be 
a random variable with {Y\S — 1) having distribution g_ and {Y\S — 0) having 
distribution jz^{pu ~ <^1-)i we see that Y = pjj. From Lemma 14.21 we conclude 
that 

disttr(<?,P(7) < H(S') + crdisttr(f?+,g-) < o-log- + o-disttr(i?+,g-). 

cr 

Now we estimate disttr('?+, 9-). From ([35]) and the Cauchy-Schwarz inequality we 
see that 

2-k/2 

h+\\t.^G), h-\\e^(G) < fj\G\l/2 - 
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Applying Lemma l43l one can find r+,r- e Prc(G) with \\r± —pu\\p(G) < l/|Gp^^ 
such that 

2-fe/2 

disttr (g± , ?'± ) < 1 + log . 

a 

By definition of Cq, we then have 

disttr (r±,p(7) < Cg- 

Putting all this together using the triangle inequality, we see that 

1 2-'=/2 
disttr(p,P;7) < fc + crlog- + (t(Cg + log ). 

cr (J 

Taking the worst-case value of a using ([5^ we conclude 

distt,.(p,pi/)<A: + 2-'=/2Cg 

and thus on taking suprcma in p 

Cg < fc + 2-'=/2^G. 
Setting k sufficiently large we conclude 

Cg < \cg + Oil) 

and the claim follows. □ 

Corollary 4.6 (Uniformisation on coset progressions). Let H + P be a proper coset 
progression of rank d in some additive group G, and let p G Pre (if + P) be such 
that 

H(p) > log|i7 + P| -logif 
for some K > 10. Let pu be the uniform distribution on H + P. Then 
(37) disttr (p, PC/) < logiT + d. 

Proof. We can view H + P as the homomorphic image of i? := H x [0, A^i) x 
. . . X [0, Nd) for some integers Ni, . . . ,Nd > 1. Let p G Ptc{B) be the puUback 
of p to B, and similarly define pu- We can then embed B in the finite group 
G:= H X Z/(2iViZ) x . . . x Z/(2iVdZ). Observe that 

n{p),ll{pu) > log |G| -\ogK- 0{d) 

and so by Theorem 14. 1[ 

disttr {p,Pg), disttr iPu , PG ) < log -K' + 

and so by the triangle inequality 

disttr (p, PC/) < logif + d. 

Observe that as p,pu both range in B, the shifts needed to transport p to pu range 
in B — B and so do not encounter the "wraparound" effects of the cyclic groups 
Z/(2iViZ), . . . , Z/(2iVdZ). Thus we can push this transport bound back to iJ + P 
and estabhsh (l37l) as desired. □ 
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5. The inverse entropy theorem 



We now prove Theorem II .111 We begin with the easy claim (i). If X is the uni- 
form distribution on a coset of a finite group then X + X is uniformly distributed on 
another coset of this group, and so (j[X] = 1 as claimed. Now suppose that <t[X] = 
1, thus H{Xi+X2) = tl{X), where Xi,X2 are independent copies of X. Inspecting 
the proof of Lemma [2Tl we conclude that H(Xi + X2) = H(Xi + which by 

the discussion after (I8ip implies that Xi + X2 and X2 are independent, or equiva- 
lently that the distribution of {Xi + X2IX2 = x) is independent of a; e range(X). 
This implies that the probability distribution of X is invariant under translations 
in range(X) — range(X), which quickly implies that range(X) — range(X) is a finite 
subgroup of G, and that X is uniformly distributed on a coset of this subgroup, as 
desired. 

Now we prove the more difficult claims (ii), (iii). We begin with a special case 
of (iii), in which Y is already uniform. 

Proposition 5.1. Let X be a G-random variable, and let H + P be a coset pro- 
gression of rank d. Let U be the uniform distribution on H + P, and suppose that 
distfl(X, U) < log is:. Then disttr(X, U) T 

Proof. By translating H + P ii necessary we may assume E H + P. We allow all 
implied constants to depend on K, d. The basic idea here is to treat 7J + P as an 
approximate group, and somehow pass to a "quotient" oi G hy H -\- P. The reader 
is encouraged to consider the special case P — {0}, in which this quotienting idea 
can be made precise. 

Let S* be a maximal subset of G with the property that the translates s + H + P 
oi H + P for s E S are all disjoint. (For G infinite, the existence of such an S is 
guaranteed by Zorn's lemma.) Clearly the translates s + {H + P) — {H + P) cover 
G. From this, the disjointness of the s + i? + P, and the greedy algorithm, we can 
thus partition G into sets Ag for s G S*, where 



One should view the partition As as a crude approximation of the (non-existent) 
quotient of G by + P. 

Take C/, X to be independent. From and the hypothesis dist/j(X, U) < log K 
(and the fact that —U is equivalent to a translate of U) one has 



s + {H + P) C As C s + {H + P) - {H + P). 



n{X + U)<^lI{X) + ^H{U) + 0{l). 



Of coursse, H(?7) 



log \H + P\. Applying ([9]), we conclude that 



H(X) =log|iJ + P| -fO(l) 



and 



U{X + U) = \og\H 



+ 



P\+0{1) 



and thus 



Px+u {x)\og 



1 



\og\H + P\ + 0{l). 



px+uix) 
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On the other hand, if s e and x ^ As, one clearly has 

Px+u{^) < p7^P(^ es + {H + P)- 2{H + P)) 

and thus 

^ '"^ + ^' + l°^P(X + C/e. + 2(ii + P)-2(F + P))- 

Since J2ses 12xeA Px+u{x) = 1, we conclude that 

E E log p(;, ^ ^ ^ , ^ + P) ^ 2(g + P)) - 

or equivalently 

E log + C/ e s + 2(i? + P) - 2(H + P)) - 

where := P{X + U e As). Observe that s + 2{H + P) - 2{H + P) can be 
covered by at most 0(1) sets As' , where s' G s + 3(i/ + P) — 3(7? + P). (Indeed, all 
such As' are disjoint, have cardinality comparable to \H + P\, and are contained in 
s + A{H + P)~A{H + P) which has cardinality 0{\H + P\).) Thus, by the pigeonhole 
principle, for every s E S there exists s'{s) E S O {s + 3{H + P) — 3{H + P)) such 
that 

P{X + U es + 2{H + P)-2{H + P)) < Cs'^s) 

and thus 

(38) Vc.log^ <0(1). 

Let Y be the random variable Y := s, where s is the unique s € S such that 
X + t/ e A^. Then X -Y takes values in {H + P)~ 2{H + P). We now claim that 

(39) H(r) < 0(1), 
or in other words that 

(40) ^P(c.)<0(l). 

This is almost but we have to replace s'(s) by s. To do this, we let C > e be 
a large quantity to be chosen later, and split the sum in (I40|) into three terms: one 
where Cs'(s) > 1/e, one where 1/e > Cs'(^s) ^ Ccs and one where Cs/(s) < Cc^. 

In the first case, observe that there are only 0(1) possible values of s'(s); each 
one of these is associated to 0(1) possible values of s (since s £ s'{s) + 3{H + P) — 
3{H + P) and the s + H + P are disjoint), so the net contribution to (|40|) here is 
0(1). 

In the second case, we observe from ([77)1 that 



F{cs) < 2F{l/C)F{Ccs) 
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if Ccs < 1/e, while Ccg > 1/e can occur at most 0(C) times, so the contribution 
of this term to (|40|) is at most 

2F{l/C)Y,F{c,,^,)) + 0{C). 

ses 

But each s' can arise from at most 0(1) choices of S, so we can bound this contri- 
bution by at most 

i5]F(c,0+O(C) = iH(r) + O(l) 

s'es 

if C = 0(1) is chosen appropriately. 
For the third case, we see that 

Cs < Cs log — +Cs log C 

Cs 

and so by (|40p the net contribution of this case is 

<0(l)+logC = 0(l). 
Thus H(r) < 5H(r) + 0(1), and the claim ([39]) follows. In particular 

disttr(^,^-F) < 1. 

But by ^ one has 

H(X -Y)> H(X) - 0(1) > log \H + P\- 0(1). 

The random variable X—Y ranges in {H+P)—2{H+P), which is a coset progression 
of rank d and cardinality 0{\H + P\). Applying Corollarv l4.6l to X — Y, we conclude 
that 

disttr(^ - Y, U(H+P)-2{H+P)) < 1 

where U(^h+p)-2{h+p) is the uniform distribution on {H + P) — 2{H + P). Direct 
computation shows that 

disttr(C/(ff+P)-2(ff+P),C/) < 1 

and the claim follows from another application of the triangle inequality. □ 

In view of the above proposition, it suffices to show that 

Proposition 5.2. Ifa[X] < K, then there exists a coset progression H + P of rank 
Ok{^) such that distp(X, U) <^k 1> where U is the uniform distribution on H + P. 

Indeed, (ii) follows immediately from Proposition 15. 21 and Proposition l5.ll while 
if we are in the situation of (iii) , then from Theorem 11.101 one has 

distfl(X, -X) < distp(X, Y) + distj7,(X, ~Y) 

<4distn{X,Y) 

< UogK 

and thus from we have a-[X] < K'^. By Proposition 15.21 one can then find 
a uniform distribution [/ on a coset progresion H + P on rank Oif(l) such that 
distii{X,U) <^K 1, hence disttr(^, «Cif 1 by Proposition 15. 11 meanwhile, from 
the Ruzsa triangle inequality (fT6|) one has distfl(F, U) <^k 1 and so disttr(5^, U) <Ca: 
1, and so by the triangle inequality one has disttrl-'^, -^k 1 as claimed. 

It remains to establish Proposition [^21 We begin with an approximate formula 
for B.{X + Y)~ B.{X) when X, Y are independent. 
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Lemma 5.3 (Sumset entropy increase formula). Let X,Y he independent. Then 

E py^y) E Px+,Wiog+^^^±^ = H(x + y)-H(x) + o(i), 

yerangc(y) zGrango(X+y) PX+Y{Z) 

where log_|_ x :— max(loga;, 0). 

Proof. To simplify the summation notation, it will be understood throughout that 
y £ range(y) and z e range(X + Y). We have 

H{X + ¥)- H{X) = Epy(y)(H(X + ¥)- H{X + yj) 

y 

y z 

Since 

^PY{y){px+Y{z) -px+v{z)) = 
y 

for all z, we thus have 

H(X+y)-H(X) = J2pYiy) E Fipx+Yiz))+F'{px+Y{zmpx+yizyPx+Yi^)yF{px+y{z)). 



PX+yiz)l0g+ + 0{pX+yiz)) + 0{px+y{z)). 



From (|76p . the summand is equal to 

PX+y{z) 

Px+y{z) 
Since 

^PY{v)^PX+y{z) = ^PY{y)^PX+Y{z) = 1, 

y z y z 

the desired claim follows. □ 

This leads us to our first structural result on random variables of bounded dou- 
bling, namely that they are approximately uniformly distributed in a set that cap- 
tures the bulk of the entropy. 

Proposition 5.4 {X is approximately uniformly distributed). If (j\X] < K, then 
there exists a non-empty set A of cardinality 

(41) \A\ exp(H(X)) 
such that 

(42) px{x) eM-^{X)) 
for all X £ A. 

Proof. We allow implied constants to depend on K. Write Z -.^ Xi + X2 for the 
sum of two independent copies of X . From Lemma 15.31 we have 

(43) ^p^(y)^p^^^(^)max(log^^^±fi^,l) « 1, 
where it is understood that y e range(X) and z £ range(Z). 
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Now let < e < 0.1 be a small constant (depending on K) to be chosen later. 
From (l43l) we have 



J^P^'^y^ Px+viz)/s <t:l 

and thus 

H px{y)px+y{z)]<t:e. 

z y.px + y(z)>e''-/^pz(z) 

Swapping y and z — y (using the identity px+y{z) = px{z — y)) we also have 

Y Px{y)Px+y{z) <.e. 

^ y-px(y)>e^/'pz(z) 

Also, observe that 

Y Y Px{v)px+y{z) <^epz{z)^px{y) 

^ y-Px+yiz)<epz{z) z y 

= e 

and similarly 

Y Px{y)Px+y{z) < s. 

^ y-px{y)<£pz{z) 

Finally, we have 

(44) ^^px(y)px+,(^) = l. 

z y 

Putting all these estimates together, (with e sufBciently small) we conclude that 

(45) J2 E Px{y)px+y{z)>l/2. 

^ y-Px{y),PXJry(z)^Pz(z) 

From (UH), ([l5|) . and the pigeonhole principle, there exists an zq G range(Z) such 
that 

Y Px{y)px+y{zQ) > pz{z^)/2. 

y-px{y),px+y{zo)^pz{zo) 
The left-hand side can be bounded crudely by 

< \{y ■ px{y) - pz{zo)}\pz{zof- 

Thus if we let A denote the set 

(46) A:={y ■.pxiy)-pz{zo)} 



then 
Since 

we conclude that 



\A\ > l/pz{za). 
1 > EP^(y) - \A\pz{zo) 

y£A 

\A\ X l/pz{zo) 



and hence by P5)l 

(47) px{y) - l/\A\ 
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for all y Cz A. In particular we have 

(48) P{X e A) X 1. 

To conclude the lemma, we need to show that 

\og\A\=U{X) + 0{l). 

We may assume by a limiting argument that the events X ^ A and X ^ A have 
non-zero probability. Let Xi,X2 be independent copies of X, and let Y be the 
indicator random variable Y = Ix^eA- Then by ([5T|) . ([75]) one has 
(49) 

H(Xi+X2) > U{Xi+X2\Y) = P{Xi e A)H(Xi+X2|Xi e A)+P(Xi ^ A)H(Xi+X2|Xi ^ A). 



Now from Lemma l2. II we have 

H{Xi + X2\Xi ^A)> U{Xi\Xi ^A)= 11{X\X ^ A) 

and 

H(Xi + X2\Xi eA)> H{X2) - U{X). 

On the other hand, since a[X] < K by hypothesis, H(Xi + X2) < H(X) + 0(1). 
Putting all these estimates together, we obtain 

H(X) + 0(1) > V{X e A)H(X) + P(X ^ A)H(X|X ^ A) 

and hence 

(50) il{X\X A) <ll{X) + 0{l/'P{X A)). 

A similar argument (swapping the roles of A and its complement) give 

(51) 'ii{X\X A) <ii{X) + 0{l/V{X ^ A)). 
But from gT]), (gH) one has 

(52) H(X|Xe A) = log|A|+0(l). 

Combining this with (l5T|), (gg]) we obtain the upper bound log \A\ < B.{X) + 0(1). 
Now we establish the lower bound. From (1521). (ISTI) one has 



H(x|y) < P(X e A) log |A| + P{X ^ A)B.{X) + o(i). 
Since Y is boolean, we have H(F) < log 2. In particular 

H(x|y) = H(x,y) -H(r) > h(x) -iog2. 

Combining this with the previous bound and (|48p we see that 

log \A\ > H(X) - 0( p^^^) > H(X) - 0(1) 
as desired. □ 
Now we show that A has large additive energy. 

Proposition 5.5 {A has large energy). // (t[X] < K, and let A be the set in 
Proposition\5.4\ Then |{ai, 02, as, 04 £ A : oi + 02 = 03 + 04}! \A\'^. 
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Proof. Again, we allow implied constants to depend on K . Let Xi, X2 be indepen- 
dent copies of X, and let Yi, Y2 be the indicators Yi = IXieA- We have 

H(X) + 0(1) >B.{Xi+X2) 

>HiXi+X2\Yi,Y2) 

= P(Xi e A)P{X2 e A)U{Xi+X2\Xi,X2 e A) 
+ P(Xi e A)P(X2 ^ A)H(Xi + e A; ^ A) 

+ P(Xi ^ A)P(X2 e A)H(Xi + X2IX1 ^ A; X2 e A) 
+ P{Xi A)P{X2 ^ A)H{Xi + X2\Xi,X2 A) 

and 

H(x) < iH(Xi,ri) + iH(X2,y2) 

< iH(Xi|ri) + iH(X2|y2) + iog2 

= P(Xi e A)P(X2 e A)(iH(Xi|Xi e ^) + ^H{X2\X2 e A)) 

+ P(Xi e A)P(X2 ^ A)(iH(Xi|Xi e A) + ^ll{X2\X2 ^ A)) 

+ P(Xi ^ A)P(X2 e A)(iH(Xi|Xi ^ A) + ^H{X2\X2 e A)) 

+ P(Xi ^ A)P(X2 ^ A)(iH(Xi|Xi ^ A) + ^H{X2\X2 ^ A)) 
+ log2. 

Now applying Lemma 12.11 we have 



ii{Xi + X2\Xi e Ai,X2 e A2) > ^TiiXi\x,eA,) + ^U{X2\X2 e A2) 

for any events Ai,A2. Inserting this into the first estimate and then subtracting 
from the second, we conclude in particular that 

P(Xi e A)PiX2 e A)in{Xi+X2\Xi,X2 e A)-^B.{Xi\Xi e A)-iH(X2|X2 e A)) < 0(l) 

and hence (by (|48l) ) 

H(Xi + X2|Xi,X2e A) <H(X|Xe A) + 0(1). 

Let X' be the random variable X conditioned to the event X ^ A. Then X' 
now obeys the hypotheses of Proposition [5T4] (with K replaced by a larger but still 
bounded quantity). Repeating the derivation of (|45D . we conclude 

J2 E PX'iy)PX'+yiz) > 1/2 

where Z' is the sum of two independent copies of X'. Observe that the summand 
here vanishes unless y, z — y e A, in which case the summand is ©(l/jAp). Setting 
X := z — y, we conclude that 

(53) \{x, yeA:pz'{x + y)^ 1/|A|}| » \A\\ 

Since pz'{z) x |{a, a' G A : a + a' = the claim follows. □ 
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From Proposition ED (or one can find E C A x A with \E\ |Ap such 

that \ A + A\ <Cif \A\. Applying the Balog-Szemeredi-Gowers theorem ('Lemma ll.2p 
we conclude there exists a subset A' of A with \A' + A'\ x^- \A'\ \A\. Applying 
Freiman's theorem in an arbitrary additive group (Theorem [Lj]) we conclude 

Corollary 5.6 (Concentration on a coset progression). Ij a[X] < K, then there 
exists a coset progression H + P of rank Ok{^) o,nd cardinality 

\H + P\ exp(H(X)) 

such that 

(54) px{x) eM-^W) 

for >K \H + P\ elements x of H + P. □ 

Now we are ready to prove Proposition [O] Let Xi,X2 be independent copies of 
X, let Y2 be the indicator of the event X2 E H + P, and let X2 be the conditioning 
of X2 to the event X2 € H + P. Let U he a, uniform distribution on H + P, taken 
to be independent of Xi,X2. From Corollarv l5.6l we have 

(55) P{X2 e if + P) X 1 
and 

(56) H(A:^) = H(X) + 0(1) = log \H + P\ + 0(1). 

(The lower bound on H(X2) follows from ([M)) and the definition of entropy; the 
upper bound follows from Jensen's inequality, Lemma lA.lH 
Next, observe that 

H(Xi+X2)>H(Xi+X2|y2) 

= P{X2 eH + P)H{Xi + X2\X2 eH + P) + P{X2 <^ H + P)H{Xi + X2\X2 <^H + P) 
> P{X2 eH + P)H(Xi + X^) + P{X2 + P)H(Xi) 
= H(Xi) + PiX2 eH + P)(H(Xi + X!^) ~ H(Xi)). 

By the hypothesis a[X] < K, one has H(A:i + X2) < Ji{Xi) + 0(1). Applying 
((55)l . one concludes that 

h(a:i + a:^) < h(a:i) + o(i). 

From this and (|56p we see that distfl(Xi, — ATj) = 0(1). Meanwhile, from Jensen's 
inequality one has 

H(X^ -[/)< log I (i/ + P)-(i/ + P) I 
< log|i? + P| + 0(1), 

which implies that distj^{X2,U) = 0(1). Applying the triangle inequality we 
obtain the claim. 

The proof of Proposition 15.21 and thus Theorem 11.111 is now complete. 
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6. Proof of Theorem 11.131 



We now prove Theorem I1T3I The basic idea is to get enough control on X that 
one can find a "smooth" direction in which to approximate the discrete random 
variable by a continuous one. 

Fix e, and assume X to be a G-random variable with H(X) sufficiently large 
depending on e. We assume for contradiction that the claim failed, thus (after 
adjusting e slightly) 

H(Xi+X2) <H(X) + ilog2-e 

We can then apply Theorem ll.llf ii) and express X = U + Z , where U is the 
uniform distribution in a coset progression H + P oi rank 0(1) and cardinality 
0(exp(H(X))), and H(Z) — 0(1). Since G is torsion-free, the H component of the 
coset progression is trivial, thus U is just the uniform distribution on P. 
Since H(Z) = 0(1), we have 

E^^Wlog^-Od). 

Let < 8 < 1/2 be a small number depending on e to be chosen later. Let 
A:= {z : pz{z) > S}, thus \A\ < 1/6. Also, since 

J2pziz) log ^ > (log -) E pziz) = (log -)P(Z ^ A) 



we see that 



This implies that 



P{Z ^A)«-\ 

log 5 
loff loi 

H(1z6a) « 



log log I 
log? 

This implies from (|85p that 

ll{X\lzeA)>li{X)-0{ 



log log I 
log I 



and thus 

(57) li{X\Z € A)P{Z eA)+ U{X\Z ^ A)P{Z ^ A) > U{X) - 0( 



log log i 



logi 

If we let Xi, Z\ and X2, Z2 be independent copies of X, Z, then we have 

H(Xi + X2) > H(Xi + X2\lz,eA, Iz.ea) 

> U{Xi + X2\Zi, Z2 e A)P{Z e Af 
+ H(Xi + X2\Zi eA;Z2^ A)P{Z e A){1 - P{Z e A)) 
+ U{Xi + X2\Z2 eA;Zi^ A)P{Z e A){1 - P{Z e A)) 
+ U{Xi + X2\Zi, Z2 ^ A){1 - P{Z e A)f. 
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From Lemma 12.11 one has 

U{Xi + X2\Z2 eA;Zi^A)> B.{X\Z e A) 

H(Xi + X2\Zi eA;Z2<^A)> U{X\Z ^ A) 
H(Xi + X2\Zi, Z2^A)> U{X\Z ^ A) 
so from (157)1 we see that 

H(Xi+X2) > H(X)-0( ^°^^°f ^ )+(H(Xi+X2|Zi,Z2 e A)-U{X\Z e Aj)P{Z e A)^. 

log 5 

Thus, by taking 5 smah enough, it will suffice to show that 

(58) H(Xi+X^) >H(X') + ilog2-£/2 

(say), where X' := {X\Z £ A) and X(, are independent copies of X'. 

Observe that X' ranges in the set A + P; since \A\ <C5 1, we may place A + P 
inside a progression Q of rank 05(1) and size 05(exp(H(X))) = OadPj); by [T51 
Theorem 1.9], we may assume that Q is 4-proper, thus 

Q = {a + nivi + . . . + ndVd : ni £ [0, Ni), ... ,nd £ [0, Nd)} 

for some d = 0^(1) and integers Ni, . . . , Nd, and the sums a + nivi + . . . + UdVd for 
ni £ [0, 4iVi), . . . , rid e [0, 4A'ci) are all distinct. Using a Freiman isomorphism (see 
e.g. [m Section 5.3]), we may thus identify Q with the box B := [0, A^i) x . . . x 
[0, Nd) in Z'^. 

Let X" be the counterpart of X' in _B, thus X" is Freiman isomorphic to X' . 
Since X' = ([/ + Z|Z e ^), with U the uniform distribution on P, we see that 

(59) px"{x)^sl/\P\-sl/\B\ 
for all X £ B. 

Now we establish some "smoothness" in the probability distribution function 
Px'^+x'^' in some short direction, as measured using the total variation metric ()lip . 

Lemma 6.1 (Smoothness of pxj'+x^')- LetQ < /i < 1. Then, i/H(X) is sufficiently 
large depending on fi,6, there exists a non-zero r £ [0, iVi) x ... x [0, iV^) with 
\r\ <5 such that 

(60) distTv {X'^ + X'{ + r,X'; + X'^)<^s li- 

Proof. For this lemma it is convenient to embed B and X" inside the finite group 

G' := Z/miZ X ... X Z/3NdZ, 

thus px" is now a function on G' . The left-hand side of ([60]) can thus be written 
as 

\px" * Px" {x + r) - px" * Px" {x) I 

xeG' 

where px" * Px" is the convolution 

PX" *Px"{x) ^ Px"{y)px"{x - y). 

We introduce the Fourier coefficients 

Px"{x) ■= X! Px"{x)x{x) 
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for all characters x ■ G' ^ in the Pontraygin dual G" of G". From Plancherel's 
theorem and (l59l) one has 



(61) X6G' ^e<^' 

<5 1. 

Thus, if we set 

(62) A:={xeG' ■.\px"{x)\>^^} 
then 

(63) |A| «5 ^i-^. 

If H(X) is large, then A^i . . . Nd is large. By the Kronecker approximation theorem 
(see e.g. [12l Corollary 3.25]), if H(X) is large enough depending on 5,fi, we may 
thus find a non-zero r G [0, A^i) x . . . x [0, Nd) with \r\ <$^s such that 

(64) \xir) -l\<^i^ 
for all X G A. 

Fix this r. From the Fourier inversion formula one has 

PX" *Px"{x) = 1^ 5^ Px"{xfxix) 
xeG' 

and thus 

PX" *px"{x + r) -px" *Px"{x) ^ 1^ E Px"{x)'^{x{r) - l)x(a;)- 

By Plancherel's theorem, we conclude 

E \px" *Px"{x + r) - PX" *px"{x)\'^ = E |px"(x)hx('') - IP- 

From ([631), ([Ml) the contribution of the terms with x e A are ©^(^^/IG'I); by ((62)) . 
(f61]l . the contribution of the terms with x ^ ^ a-re also Osipi^ /\G'\). We thus have 

\px" *px"{x + r) -px" *Px"{x)\'^ <5 

and the claim follows from the Cauchy-Schwarz inequality. □ 

Let < < 1 be a small number (depending on S, e) to be chosen later, and let 
r be as in the above lemma. We can write r = mr' , where m > 1 is an integer with 

(65) m<5M"°'^^\ 

and r' is irreducible in Z''. Applying yet another Freiman isomorphism, we may 
then map B to the integers so that r maps to m. If X'" is the image of X" under 
this isomorphism, then X'" is isomorphic to X' , ranges over at most \B\ values, 
and 

(66) distry [X'l' + X'^' + m, + X^") <5 
Our task is now to show that 

(67) H(X(" + X'^') > H(X"') + i log 2 - e/2. 
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To motivate the general argument later, let us first consider the simpler irre- 
ducible case when m — 1, thus the distribution function px"'+x"'{x) looks "locally 
smooth" . To exploit this, let U be the continuous random variable uniformly dis- 
tributed in [0, 1], independent of X'" . Recall that the continuous Shannon entropy 
Hr(T^) of a random variable on R with distribution pv{x) dx is given by 

Hr(F) := / F{pv{x)) dx. 

A short computation then relates the continuous Shannon entropy to the discrete 
entropy: 

Ur{X"' + U)=B.{X"'). 
Now let us look at the continuous variable V := X'(' + t/i + Xj" + C/2- We write 

Hr(F) = log |P| + / F{vvix)) - py{x) log \P\ dx 
Jr 

where pv is the density function of V. 

Observe that for any x e [n, n + 1], the density function pv{x) of 1^ at a; is equal 
to some average oi px'^"+x;,"{n), Px',"+x^"{n - 1), thus 

py{x) = pxi"+xf,"{n) + 0{g{n)) 

where 

g{n):= \px'^'+x'^'{n) - Px'/'+x;j'{n - 1)|. 
In particular pv{x) <C5 by ([5^ . Using the elementary estimate 

Fib) - b\og \P\ = Fia) - alog \P\ + 0^((^ + \b - a|) log i) 

when a, b ^/\P\ (which arises from the fact that F'{c) = log \R\ + Os{\og ^) for 
fJ./\R\ < c <5 1/l-Pl), we thus have 

Hr(F) =log|P|+ ^ F{px^.+x-in))-pxi"+x^"in)\og\P\+Osii^+g{n))\og-)- 

From (IMl), E„erangc(x("+x-)+{o,i}( #1 + Si^'l'l ^^'^ ^^^^ 

Hr(F) =H(X(" + xn+Oi(Mlog-). 

M 

On the other hand, from Shannon's inequality ([20]), we have 

nniV) > HR(X"' + C/) + ilog2. 

Putting all this together, one obtains 

H(Xr' + X^') > n{X"') + i log2 - 0^(Mlog 

and the claim (|67p follows by taking ^ small enough. 

Now we return to the general case, when m is not necessarily 1. We then intro- 
duce the random variable W :— X'" mod m e Z/mZ, and define Wi,W2 analo- 
gously. Then 

H(X(" + X!^') > H(Xf + X^'\Wi). 
Observe that X"' + X2' and Wi determine W2 , and thus 

H(x;" + x!;'\Wi) = u{W2) + n{x'^' + x'^'\WuW2). 
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We can write 

H(X;" + X'^'\Wu W2) = J2 Pw^MPw^MHiXi,^, + X2,^J 

where for i — 1,2, Xi,^^ is the Z-random variable [Xi — Wi)/m conditioned to the 
event Wi — Wi. Meanwhile, 

H(X"') - H(W^) + PwA^imXi^w,) 

wieZ/mZ 

and similarly with the 1 index replaced by 2, thus 

H(X"') - H(W^) + ^ pw^iWl)pwAw2)lmXl,^a^)+^l{X2,^,)); 

wi ,i(;2£Z/mZ 

putting all this together, we see that to show ([67)l . it will suffice to show that 
(68) 

J2 PWi(wiW2(ii'2)[H(Xi,„,+X2,„,)--(H(Xi,^J+H(X2,^J)] > -log2- 

From Lemma l2.11 the expression in brackets is non- negative. Thus we may restrict 
the sum to a smaller range of wi,W2', more specifically, we will restrict to the range 
where 

(69) PWi{wi),pw2{w2) > p/m. 
On this range, we have from (f59|) . (16511 that 

for all z = 1,2 and x S Z; also observe that pxi „. takes on at most |P| values. 
From ([66|) we have 

(70) ^ pw,{wi) distTviXi^w, + i,Xi^^^) '^s fJ. 

WieZ/mZ 

for i = 1, 2. We will now restrict wi, W2 further, by imposing the additional restric- 
tion 

(71) distTv(X»,^. + 1, < M^/' 
for i = 1,2. 

Now we repeat the arguments from the m = 1 case. Let Ui,U2 be independent 
copies of the uniform distribution of [0, 1], then as before we have 

Hr(X,,,„, +[/,)= H(X,,„ J 

and 

Hr(Xi,„, + C/i + X2,^, + C/2) = H(Xi,^, + X2,„J + (Ai^/2 log i); 

applying ((20)) . we conclude 

H(Xi,^, +X2,^,)-i(H(Xi,^J + H(X2,,„J) > ^log2 -05(^^1/2 log 1). 

To conclude the proof of (|68p. it thus suffices (on taking /i small enough) to show 
that 

PwA'Wi)pw2{w2) > l-e/4 

'Wi,'W2 
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(say), where wi,W2 range over all pairs in Z/mZ obeying ([69|) . ((7T|) . But the 
contribution of those wi,'W2 that fail to obey ([69|) is 0(/i), while from ((70|) the 
contribution of the wi,i«2 that fail to obey (1711) is 0{fi^^^), and the claim follows 
by taking /z small enough. 

Appendix A. Basic properties of entropy 
The function F{x) xlog^ defined in ^ has first derivative 

FUx) = log - - 1 

X 

and second derivative 

F"ix) = -i 

X 

for X > 0; from this one easily concludes that F is concave on R+, and increasing 
for X < 1/e. In particular, we have the upper bound 

(72) Fix) < F{l/e) = 1/e, 
the inequality 

(73) F{y)<F{x)+F'{x){v~x) 
for all y > and a; > 0, and the subadditivity property 

(74) F[x + y)<F{x)+F{y) 

for all X,?/ > 0. In particular we have the triangle inequality 

(75) \F{a)-F{h)\<F{\a-h\) 
for < a, 6 < 1/e. From the identity 

F{x) + F'{x){y ^x)- F{y) = y{- - I - log -) 

y y 

and ([73]) we obtain the bound 

(76) Fix) + F'{x){y ~ x) - F{y) = ylog+ ^ + 0{x) + 0{y) 
where log_|_ x max(loga;, 0). Finally, from the identity 

F{ax) = F{a)F{x){-\ + -\) 
log - log - 

we see that 

(77) F{ax) < 2F{a)F{x) 
whenever < a, a; < 1/e. 

Lemma A.l (Jensen bound). Let A he a finite set, and let X be an A-random 
variable. Then H(X) < logjAj. Furthermore, if 

H(A) > log|A| - log if 

for some K > 1, then 

oo 

^2'=P(X € Ak) < 1 + logif 

where 

Ak := {xeA: 2^"'" < pxix)\A\ < 2^'}. 
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Proof. For the first bound, we observe that 
H(X) = ^F(px(.x)) 

-log 1^1 

as requhed. Similarly, if H(X) > log|A| — log if, then the above argument shows 
that 

E ^(^) + F'i^\)(Pxi^) - ^) - FiPxix)) < logK. 

From ([TS]) . the summand is non-negative; from ((76)) . the summand ispxix) log{\A\px{x))- 
0{pxix)) for px{x) > 1/1^1, and the claim follows by decomposing the x variable 
into the sets Ak- □ 

Let X be a discrete random variable, and let E be an event which occurs with 
positive probability. Then we can define the conditioned random variable {X\E) 
by restricting the underlying probability measure to E (and then dividing out by 
P(£') to recover the normalisation), thus 

Pl^x\E)ix)=P{xeXAE)/PiE). 

In the special case where E is an event of the form X E A for some set A, we 
conclude that 

Ia{x)px{x) 

Given two random variables X, Y (not necessarily independent) , we define the con- 
ditional entropy tl{X\Y) by the formula 

(78) B.{X\Y):= (y)H(X|r = y). 

aGrangc(Y) 

A standard computation reveals the identity 

(79) H(x|r)-H(x,y)-H(r), 

and in particular 

(80) U{X\Y) =ll{X,Y\Y). 
Meanwhile, one has the total probability formula 

Pxix)= Py{v)P{x\y=v){x)] 

yerango(F) 

comparing this with (j78p and Jensen's inequality using the concavity of F we con- 
clude that 

(81) H(X|r) < H(X) 

with equality if and only \i {X\Y = y) = X for all y E range(y), or in other words 
if X and Y are independent. From this and (|79p we conclude that 

(82) U{X,Y) <U{X) + H{Y) 
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We say that a discrete random variable Y is determined by another discrete 
random variable X, if one has Y = f{X) for some function / : range(X) 
range(y). From the subadditivity property ([7^ we see that 

(83) H(F) < U{X) 

whenever X determines Y. For instance, since {X, Y) determines both X and Y, 

(84) H(X),H(F) <H(X,r), 
and hence by ^ 

(85) H(X)-H(y) <H(X|y) <H(X). 

If X determines Y, then X and {X, Y) determine each other, and so H(X, Y) — 
H{X): in particular, 

(86) H(X|y) = H(X) - H(y) 

and H(y|X) = 0. 

We have the following useful inequality: 

Lemma A. 2 (Submodularity inequality). If Xq, Xi, X2, X12 are random variables 
such that Xi and X2 each determine Xq, and {Xi,X2) determine X12, then 

H(Xi2) + H(Xo) < H(Xi) + H{X2). 

Proof. By ^ it suffices to show that 

H(Xi2|Xo) < H(Xi|Xo) +H(X2|Xo). 

By ([75|) it suffices to show that 

U{X,2\Xo = xo) < H(Xi|Xo = xo) + H(X2|Xo = xq) 

for all Xo € range(Xo). But by hypothesis, (Xi|Xo = xq) and {X2\Xo = xq) 
determine {Xi2\Xq = xq), and the claim then follows from ([5^ and ([551) . ^ 

As a special case of Lemma rA.2l (and (|80)) ) we see that 

(87) H(y|Z) < B.{X\Z) 
whenever (A, Z) determines Y. Similarly, we have 

(88) H(A,r|Z) < H(A|Z) +H(r|Z) 

for any A, Y, Z, with equality if and only if {X\Z = z) and {Y\Z = z) are indepen- 
dent for all z G range(Z), i.e. if A and Y are conditionally independent relative to 
Z. 
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